Chromosome-level genome assembly and annotation of the prickly nightshade Solanum rostratum Dunal

The prickly nightshade Solanum rostratum, an annual malignant weed, is native to North America and has globally invaded 34 countries, causing serious threats to ecosystems, agriculture, animal husbandry, and human health. In this study, we constructed a chromosome-level genome assembly and annotation of S. rostratum. The contig-level genome was initially assembled in 898.42 Mb with a contig N50 of 62.00 Mb from PacBio high-fidelity reads. With Hi-C sequencing data scaffolding, 96.80% of the initially assembled sequences were anchored and orientated onto 12 pseudo-chromosomes, generating a genome of 869.69 Mb with a contig N50 of 72.15 Mb. We identified 649.92 Mb (72.26%) of repetitive sequences and 3,588 non-coding RNAs in the genome. A total of 29,694 protein-coding genes were predicted, with 28,154 (94.81%) functionally annotated genes. We found 99.5% and 91.3% complete embryophyta_odb10 genes in the pseudo-chromosomes genome and predicted gene datasets by BUSCO assessment. The present genomic resource provides essential information for subsequent research on the mechanisms of environmental adaptation of S. rostratum and host shift in Colorado potato beetles.


Background & Summary
The prickly nightshade, Solanum rostratum Dunal (Solanales: Solanaceae), an annual plant, is an invasive alien malignant weed which classified as an "agricultural weed", an "environmental weed", and a "noxious weed" in the Global Compendium of Weeds 1 . In China, it is listed as an entry quarantine pest and key management alien invasive species. This species has a fast growth rate and strong reproductive ability, whose seed production reaching 78,500 seeds per plant 2 . High competitiveness in light, water, nutrients, ecological niche, and other resources results in reduced agricultural land production, loss of native species' competitive advantage, and decrease in biodiversity. In addition, the densely covered narrow and long prickles on the surface of the stem, leaf, calyx, and fruit can be mixed with fodder to hurt the oral cavity and gastrointestinal digestive tract of livestock. Moreover, the neurotoxin solanine present in whole plants can cause livestock poisoning 3 . It is also the host of the Colorado potato beetle Leptinotarsa decemlineata 4 , which is the most destructive pest on potatoes, the tomato golden mottle virus 5 , and the tomato severe leaf curl virus 6 . Thus, the invasion of S. rostratum seriously threatens the local ecological environment, agricultural production, grassland animal husbandry, biodiversity, and human health (Fig. 1).
The extremely strong ecological adaptability (could survive in the wasteland, grasslands, overgrazing pastures, roadside, garbage dumps, orchards, courtyards, irrigation ditches, and river beaches) 7 and stress resistance (barren, drought, wet, and salt 8 ) facilitate S. rostratum to spread and establish in a new environment as a dominant species. Native to North America 9 , S. rostratum is widely distributed in 34 countries and regions, including North America, Asia, Africa, Europe, and Oceania 10 . In China, since its first detection in 1981 in Chaoyang City, Paired-end short-reads. These results indicate that the present genome assemblies and annotations are contiguous and accurate. Furthermore, comparative genomic analysis was conducted with other nineteen solanaceous species to provide insight into their phylogenetic relationship, divergence time, whole-genome duplication (WGD) events along the solanaceous speciation, and genomic evolutionary history. Thus, the present S. rostratum genomic resource will be a foundation for subsequent research on this weed.

Methods
Plant material collection and preparation. Healthy mature plants of S. rostratum were collected from the wasteland in Chaoyang City, Liaoning Province, China (120.504360° E, 41.604752° N) in August 2021. After washing with deionized water, the roots, stems, leaves, flowers, and fruits were harvested. All the tissues were put into liquid nitrogen immediately and preserved in an ultra-low temperature freezer until use.
DNA library construction and genome sequencing. High molecular weight genomic DNA was extracted from tender leaves with a modified 2 × cetyltrimethylammonium bromide (CTAB) method 30 . Approximately 200 mg of tender leaves were ground to powder using liquid nitrogen and then added to 800 μL of CTAB lysis buffer in a 2.0-mL tube. After incubation at 65 °C for 60 min, 800 μL of phenol/ chloroform/ isopentanol (25:24:1) was added and centrifuged at 12,000 rpm for 10 min. The supernatant was extracted into another 2.0-mL tube with an equal volume of chloroform/isopentanol (24:1). After mixing by gentle inversion, the tube was centrifuged at 12,000 rpm for 10 min. The supernatant was extracted to another tube with 0.6 times the  www.nature.com/scientificdata www.nature.com/scientificdata/ volume of precooled (−20 °C) isopropanol. After being placed at −20 °C for over 2 h, the tube was centrifuged at 12,000 rpm for 10 min. The pellet was washed twice with 75% ethanol and dissolved in 50 µL of DNase and RNase-free Water for further study.
For the Illumina whole-genome shotgun raw sequencing, the genomic DNA was randomly fragmented, and a library with an average insert size of 350 bp was constructed using the Illumina TruSeq Nano DNA Library Prep Kit (Illumina, USA) following the manufacturer's instructions. The library was sequenced on the Novaseq 6000 Platform set in the PE150 program, generating a total of 103.47 Gb of raw data. After filtering by fastp v0.12.4 31 with default to remove low quality and short reads and cut adapters and polyG, 102.14 Gb (113.69×) clean data were retained for the genome size estimation ( Table 1).
The PacBio Sequel II System, based on single-molecule real-time (SMRT) sequencing technology under the Circular Consensus Sequencing (CCS) model, was used for whole-genome sequencing. The DNA template was sheared by g-TUBE (Covaries, USA) to an average size of 15-20 kb, and the target DNA fragments were obtained using BluePippin TM Size-Selection System (Sage Science, USA). The library was constructed using SMRTbell Template Prep Kit 1.0 (Pacific Biosciences, USA) following the procedure and loaded onto PacBio Sequel ™ Systems to read the sequence. Finally, approximately 366.02 Gb subreads were obtained with an average length of 13.59 kb and an N50 length of 15.25 kb after removing adaptors in polymerase reads (Table 1).

RNA library construction and transcriptome sequencing.
Total RNA was isolated from the roots, stems, leaves, flowers, and fruits, respectively, using the standard TRIzol protocol (Invitrogen, USA) 32 . Approximately 100 mg of tissue was ground to powder using liquid nitrogen, and then 1000 μL of TRIzol was added in a 2.0-mL tube. After allowing the solution to stand for approximately 5 min, 200 μL of chloroform was added, shaken vigorously for 30 s, and allowed to stand for 3 min. After centrifugation at 12,000 rpm for 15 min at 4 °C, the upper aqueous phase was extracted to another 1.5-mL tube with 500 μL of isopropanol and then mixed by gently inverting. After standing for approximately 10 min, the tube was centrifuged at 12,000 rpm for 10 min. The supernatant was removed, and the pellet was washed twice with 75% ethanol and dissolved in 50 µL of DNase and RNase-free Water for further study.
For the Illumina paired-end reads sequencing, the mRNA was synthesized to cDNA, and five libraries were constructed with an insertion size of 350 bp using a TruSeq RNA library preparation kit (Illumina, USA) following the manufacturer's instructions. Whole-genome shotgun raw sequencing was performed using the Novaseq 6000 Platform set in the PE150 program. In total, 32.91 Gb of clean data were generated from the RNA-seq library after filtering using fastp 31 (Table 1).
For Iso-seq under the CCS model, the RNA samples extracted from root, stem, leaves, flowers, and fruits were equally mixed for sequencing. cDNA was synthesized using a Clontech SMARTer PCR cDNA Synthesis Kit (Takara Biotechnology, China). Then, the SMRTbell library (cDNAs length over 4 kb) was constructed using the Pacific Biosciences SMRTbell template prep kit (Pacific Biosciences, USA) and sequenced on the Pacific Bioscience Sequel II platform. A total of 19.81 Gb subreads were obtained with an average length of 2,562 bp and an N50 length of 3,005 bp after removing adaptors in polymerase reads ( Table 1). The exported subreads were analyzed using packages of SMRT link v10.1, including highly accurate consensus sequence calling using package ccs v6.0.0 (https://github.com/PacificBiosciences/ccs), primer removal and demultiplexing using package lima v2.1.0 (https://github.com/pacificbiosciences/barcoding/), polyA tail and artificial concatemers removal using package isoseq3 v3.4.0 (https://github.com/PacificBiosciences/IsoSeq), and clustering and polishing using package isoseq3 v3.4.0. Finally, approximately 387.83 Mb high-quality consensus isoform sequences were generated with an average length of 3,843 bp.
Contig-level genome assembly. The in-built High-Quality Region Finder (HQRF) was used to identify the longest high-quality regain for each read of exported subreads according to the signal noise ratio (SNR). HiFi reads were then generated from filtered subreads using the CCS model of SMRT link v10.1 with the following parameters: --maxLength = 50000, --minPasses = 3, and --minPredictedAccuracy = 0.99. The sequences in fastq.gz were converted from the BAM file using bam2fastx v1.3.1 (https://github.com/pacificbiosciences/bam-2fastx/). 25.83 Gb (28.75×) of CCS reads were obtained with an average length of 15.34 kb and an N50 length of 15.78 kb (Table 1). Then, Hifiasm v0. 16.0 25 was used to assemble the genome into contigs with default parameters. To check for the potential contaminant sequences, assembled contigs were classified using Kraken2 against the www.nature.com/scientificdata www.nature.com/scientificdata/ custom database 33 . Four contigs were identified as bacteria (904,041 bp, 0.10%), which were flagged and removed from the final assembly. After removal, the final contig-level assembly was submitted to the NCBI independent contamination check to confirm the result, resulting in an 898.42 Mb contig-level genome consisting of 113 contigs and an N50 length of 62.00 Mb ( Table 2).
Hi-C library construction and pseudo-chromosome anchoring. Tender leaves were cut into approximately 2-cm 2 pieces for cellular protein cross-linking in 2% formaldehyde. The isolated DNA was purified, digested with Dpnii restriction enzyme, tagged with biotin-14-dCTP, sheared into 300-600 bp fragments, and blunt-end-repaired. Then, the Hi-C library was sequenced using the Illumina NovaSeq platform, which generated 100.16 Gb filtered clean data (113.63×) to anchor contigs into pseudo-chromosomes ( Table 1). The cleaned Hi-C sequencing data were aligned on the contig assembly using bowtie2 v2.2.5 34 to obtain the unique mapped paired-end reads using the following parameters:--very-sensitive -L 20--score-min L, -0.6, --0.2 --end-to-end --reorder --rg-id BMG --phred33-quals -p 5. Quality control of read alignment and pairing was conducted using HiC-Pro v2.7.8 26 to discard low-quality alignment, singleton, multiple hits, and invalid pairs. A total of 156,223,644 valid paired-end reads were used to build the interaction matrices and scale up the primary genome assembly in contigs to chromosome-scale scaffolds (pseudo-chromosomes). A total of 869.69 Mb of the contig-level assembled sequences (96.80% anchored rate) were anchored and orientated onto 12 pseudo-chromosomes, which was consistent with the karyotype (2n = 24) analysis 35 , with lengths ranging from 63.15 to 92.28 Mb (Table 3). In summary, the size of the pseudochromosome-level S. rostratum genome that was obtained was 869.69 Mb with 212 unanchored contigs (total length 28.73 Mb), with a contig N50 of 72.15 Mb (Table 2). To validate the correction of the pseudo-chromosome anchoring result, the pseudo-chromosomes were divided into bins of equal size in 50 kb to construct genome-wide interaction matrices based on the interaction signals between each pair of bins. The interaction matrix heatmap was visualized using HiCPlotter v0.6.6 36 (Fig. 3).

Genome annotation and functional prediction.
Identifying repeat sequences. The repeat sequences in the genome were identified using a combination of homologous sequence prediction and ab initio prediction. For homologous sequence prediction, RepeatMasker v1.323 37 and RepeatProteinMask v1.36 38 were used to predict the homology sequences against known repeat sequences in the database RepBase 39 . For ab initio prediction, RepeatModeler open-1.0.8 40 was used to establish a de novo repeat sequence database, and RepeatMasker v1.323 37 was used for prediction. Tandem Repeats Finder (TRF) v4.07b 41 was used to find tandem repeat sequences in the genome. Combined with the results, 649.21 Mb repeat sequences were identified, accounting for 72.26% of the S. rostratum genome. The four predominant categories were long terminal repeats (LTR) (accounting for 46.06% of genome size), long interspersed nuclear elements (LINE) (3.62%), DNA elements (3.14%), and short interspersed nuclear elements (SINE) (0.22%) ( Table 4).
Identifying non-coding RNA (ncRNA) gene. Rfam 42 was used to predict ribosomal RNAs (rRNAs), small nuclear RNAs (snRNAs), and micro RNAs (miRNAs) by comparison with known non-coding RNA libraries. Transfer RNAs (tRNAs) were predicted using tRNAscan-SE v1.3.1 43 . In total, 3,588 ncRNAs were annotated in the S. rostratum genome, including 547 miRNAs, 1,288 tRNAs, 1,110 rRNAs, and 643 snRNAs (Table 5). www.nature.com/scientificdata www.nature.com/scientificdata/ Gene structure prediction. Three strategies were applied to predict the gene structure from the repeat-masked genome. The first strategy was homologous prediction. BLAST v2.2.28 44 with an E-value cutoff of 1e-5 and GeMoMa v1.6 45 were used to predict gene structure by comparing with seven closely related species (C. annuum 19 , Solanum chilense 46 , Solanum commersonii 47 , S. lycopersicum 16 , S. melongena 20 , Solanum pennellii 48 , and S. tuberosum 18 ). The second strategy was based on transcriptome data. The filtered Illumina RNA-seq sequences from five libraries were assembled into transcripts using Trinity v2.11.0 49 with default parameters. Then, the Trinity RNA-Seq assemblies and full-length cDNAs were aligned and mapped to the soft-masked genome   www.nature.com/scientificdata www.nature.com/scientificdata/ assembly using GMAP v2014-10-2 50 and BLAT Src35 51 . Candidate gene structures were extracted from the PASA v2.1 28 pipeline based on the open reading frame (ORF). The third strategy included using ab initio prediction based on the characteristics of the genomic sequence data. Using Augustus v3.3 52 , SNAP v38926 53 , and GeneMark v4.33 54 , 29,485, 33,190, and 26,142 protein-coding genes were identified, respectively. Finally, EVM v1.11 27 integrated the above three strategies, resulting in a non-redundant gene set, with weighting as default. Overall, 29,694 protein-coding genes were obtained, with an average gene length of 4,308 bp, cds length of 1,172 bp, exon length of 237 bp, and intron length of 795 bp (Table 6).
Solanaceous orthology identification, phylogenetic tree construction, and divergence time estimation. Twenty solanaceous species were selected for comparative genomic analysis, with Ipomoea trifida as the outgroup. The longest transcripts, which were extracted using TBtools v1.106 63 , were used as the gene set for the following analysis. The orthogroups and orthologs classification were identified using Orthofinder v2.5.4 64 with parameters -S diamond, -M msa, and -T fasttree. As a result, 824,030 genes (93.10% of total genes) were assigned to 56,426 orthogroups among 21 species, with 7,963 orthogroups shared in all the species and 799 shared single-copy orthogroups. Among the 29,694 genes in S. rostratum, 28,514 were clustered into 17,237 orthogroups, with 12,096 genes in single-copy orthologs, 16,418 genes in multiple-copy orthologs, and 1,065 genes in 298 species-specific orthogroups.
A phylogenetic tree was constructed using the concatenated 799 single-copy orthogroup gene alignment generated using Orthofinder 64 . The maximum-likelihood method software raxmlHPC v8.2.12 65 was implemented with the parameters -m PROTGAMMAJTT, -f a, and -# 100. The solanaceous tree recovered the monophyly of 3 subfamilies, 5 tribes, and 6 genera with 100 support values at all nodes, revealing a sister group relationship between S. rostratum and S. melongena + S. aethiopicum (Fig. 4a).

WGD analysis.
To investigate the WGD event history of Solanaceae, the synonymous substitution rate (Ks) frequency density distributions of syntenic orthologous block pairwise between genomes and syntenic paralogous block pairwise within genomes were calculated by wgd v1.1.2 68 , including S. rostratum (Sros), Vitis vinifera 69 (Vvin), Ipomoea trifida 70 (Itri), S. lycopersicum 16 (Slyc), and S. melongena 20 (Smel). For one-versus-one orthologs Ks distributions calculation, the module dmd was implemented to extract orthologs by all-versus-all blastp using the diamond 71 algorithm with the parameters-nostrictcds -e 1e-10. The module ksd 66 was then used to construct one-versus-one ortholog Ks distributions. For whole-paranome Ks distribution calculation, the module dmd was used to extract paralogs and cluster gene families using the Markov cluster (MCL) 72 algorithm. Then, the module ksd 66 was used to construct whole-paranome Ks distributions with the parameter -mp 1000. Finally, the module syn identified and extracted paralogs in intra-genomic colinear blocks using i-ADHoRe v3.0 73 . A shared peak was detected within Solanaceae at approximately 0.68, which occurred after the divergence peak with V. vinifera, and before the Solanaceae speciation peak, indicating that an an ancient WGD occurred in the ancestor of the Solanaceae. However, there was no subsequent WGD after species differentiation within the Solanaceae. Within Solanaceae, S. rostratum first diverged from S. lycopersicum at 0.14, and S. melongena at 0.03 (Fig. 4b).
Whole-genome synteny. To understand the extend of genomic rearrangement of S. rostratum during evolution, whole-genome synteny analysis was conducted between S. rostratum (Sros) and S. lycopersicum 16 (Slyc), and between S. rostratum and S. melongena 20 (Smel). The protein sequences of Sros and Slyc, and Sros and Smel were blasted using blastp with parameter -evalue 1e-5. The multiple alignments of syntenic blocks were identified by MCScanX 74 with the parameter -s 15 (number of genes required to call a collinear block) and visualized by jcvi v1.2.8 75 with the parameter-minspan = 30. The complicated conserved syntenic blocks among the twelve pseudo-chromosomes, indicate that visible genome rearrangements occurred during evolution among Solanum (Fig. 4c).  Table 7. Statistics for the Solanum rostratum functionally annotated protein-coding genes.
The final chromosome-level assembled genome sequences were deposited in the NCBI Assembly database under Accession Number JARACL000000000 85 .
The genome annotation results, including repeated sequences, gene structure, and functional predictions were deposited in the Figshare database (https://doi.org/10.6084/m9.figshare.22016024) 86 . evaluating the completeness and quality of the genome assembly and annotation. Flow cytometry analysis. FACScalibur Flow cytometry (BD Biosciences, USA) analysis 87 was conducted to estimate the S. rostratum genome size with three replicates, and ModFit software v5.0 (Yerity SoftwareHouse, USA) was used to analyze the results. The genome size of the internal reference standard Glycine max is 978.4 Mb (1 pg DNA = 0.978 G) 88 . The 2 C DNA content in pg of S. rostratum was calculated according to the following formula 89  Mapped to the genome using Illumina data. Illumina paired-end reads were mapped back to the draft genome using Burrows-Wheeler Aligner (BWA) v0.7.9a 90 . Then, depth, mapping rates, and coverage at each position were calculated using samtools v0.1.19 91 . The results showed that 99.04% of read pairs were mapped to the genome with an average depth of 105.24 and a coverage rate of 97.54%, indicating high single-base concordance.
BUSCO assessment. The completeness of the contig-level genome, Hi-C pseudo-chromosome-level genome, and predicted gene datasets were further evaluated with BUSCO (default parameters) v5.1.2 29 based on the ortholog database embryophyta_odb10 (1,614 genes). The results were visualized by the python script gener-ate_plot.py of BUSCO, showing a high completeness level with 99.4%, 99.5%, and 91.3% complete genes found in the contig-level genome, Hi-C pseudo-chromosome-level genome, and predicted gene datasets, respectively (Fig. 5).
Protein coding genes comparison with close species. To determine the prediction accuracy and reliability, the distribution of mRNA length, CDS length, exon length, intron length, and exon number in S. rostratum and other closely related species (C. annuum 19 , S. chilense 46 , S. commersonii 47 , S. lycopersicum 16 , S. melongena 20 , S. pennellii 48 , and S. tuberosum 18 ) were determined. The consistent distribution tendency among all species further supported an ideal annotated gene dataset in S. rostratum (Fig. 6).
Hence, a high-quality completeness and accuracy S. rostratum genome was assembled and annotated in the present study.